Working with data files

Reading and writing data files is a common task, and Python offers native support for working with many kinds of data files. Today, we're going to be working mainly with CSVs.

Import the csv module

We're going to be working with delimited text files, so the first thing we need to do is import this functionality from the standard library.


In [ ]:
import csv

Opening a file to read the contents

We're going to use something called a with statement to open a file and read the contents. The open() function takes at least two arguments: The path to the file you're opening and what "mode" you're opening it in.

To start with, we're going to use the 'r' mode to read the data. We'll use the default arguments for delimiter -- comma -- and we don't need to specify a quote character.

Important: If you open a data file in w (write) mode, anything that's already in the file will be erased.

The file we're using -- MLB roster data from 2017 -- lives at data/mlb.csv.

Once we have the file open, we're going to use some functionality from the csv module to iterate over the lines of data and print each one.

Specifically, we're going to use the csv.reader method, which returns a list of lines in the data file. Each line, in turn, is a list of the "cells" of data in that line.

Then we're going to loop over the lines of data and print each line. We can also use bracket notation to retrieve elements from inside each line of data.


In [ ]:
# open the MLB data file `as` mlb
with open('data/mlb.csv', 'r') as mlb:
    
    # create a reader object
    reader = csv.reader(mlb)
    
    # loop over the rows in the file
    for row in reader:
        
        # assign variables to each element in the row (shortcut!)
        name, team, position, salary, start_year, end_year, years = row
        
        # print the row, which is a list
        print(row)

Simple filtering

If you wanted to filter your data, you could use an if statement inside your with block.


In [ ]:
# open the MLB data file `as` mlb
with open('data/mlb.csv', 'r') as mlb:
    
    # create a reader object
    reader = csv.reader(mlb)

    # move past the header row
    next(reader)
    
    # loop over the rows in the file
    for row in reader:

        # assign variables to each element in the row (shortcut!)
        name, team, position, salary, start_year, end_year, years = row
        
        # print the line of data ~only~ if the player is on the Twins
        if team == 'MIN':
        
            # print the row, which is a list
            print(row)

Exercise

Read in the MLB data, print only the names and salaries of players who make at least $1 million. (Hint: Use type coercion!)


In [ ]:
# open the MLB data file `as` mlb
with open('data/mlb.csv', 'r') as mlb:
    
    # create a reader object
    reader = csv.reader(mlb)
    
    # move past the header row
    next(reader)
    
    # loop over the rows in the file
    for row in reader:

        # assign variables to each element in the row (shortcut!)
        name, team, position, salary, start_year, end_year, years = row
        
        # print the line of data ~only~ if the player is on the Twins
        if int(salary) >= 1000000:
        
            # print the row, which is a list
            print(name, salary)

DictReader: Another way to read CSV files

Sometimes it's more convenient to work with data files as a list of dictionaries instead of a list of lists. That way, you don't have to remember the position of each "column" of data -- you can just reference the column name. To do it, we'll use a csv.DictReader object instead of a csv.reader object. Otherwise the code is much the same.


In [ ]:
# open the MLB data file `as` mlb
with open('data/mlb.csv', 'r') as mlb:
    
    # create a reader object
    reader = csv.DictReader(mlb)
    
    # loop over the rows in the file
    for row in reader:

        # print just the player's name (the column header is "NAME")
        print(row['NAME'])

Writing to CSV files

You can also use the csv module to create csv files -- same idea, you just need to change the mode to 'w'. As with reading, there's a list-based writing method and a dictionary-based method.


In [ ]:
# define the column names
COLNAMES = ['name', 'org', 'position']

# let's make a few rows of data to write
DATA_TO_WRITE = [
    ['Cody', 'IRE', 'Training Director'],
    ['Maggie', 'The New York Times', 'Reporter'],
    ['Donald', 'The White House', 'President']
]

# open an output file in write mode
with open('people-list.csv', 'w') as outfile:
    
    # create a writer object
    writer = csv.writer(outfile)
    
    # write the header row
    writer.writerow(COLNAMES)
    
    # loop over the data and write to file
    for human in DATA_TO_WRITE:
        writer.writerow(human)

Using DictWriter to write data

Similar to using the list-based method, except that you need to ensure that the keys in your dictionaries of data match exactly a list of fieldnames.


In [ ]:
# define the column names
COLNAMES = ['name', 'org', 'position']

# let's make a few rows of data to write
DATA_TO_WRITE = [
    {'name': 'Cody', 'org': 'IRE', 'position': 'Training Director'},
    {'name': 'Maggie', 'org': 'The New York Times', 'position': 'Reporter'},
    {'name': 'Donald', 'org': 'The White House', 'position': 'President'}
]

# open an output file in write mode
with open('people-dict.csv', 'w') as outfile:
    
    # create a writer object -- pass the list of column names to the `fieldnames` keyword argument
    writer = csv.DictWriter(outfile, fieldnames=COLNAMES)
    
    # use the writeheader method to write the header row
    writer.writeheader()
    
    # loop over the data and write to file
    for human in DATA_TO_WRITE:
        writer.writerow(human)

You can open multiple files for reading/writing

Sometimes you want to open multiple files at the same time. One thing you might want to do: Opening a file of raw data in read mode, clean each row in a loop and write out the clean data to a new file.

You can open multiple files in the same with block -- just separate your open() functions with a comma.

For this example, we're not going to do any cleaning -- we're just going to copy the contents of one file to another.


In [ ]:
# open the MLB data file `as` mlb
# also, open `mlb-copy.csv` to write to
with open('data/mlb.csv', 'r') as mlb, open('mlb-copy.csv', 'w') as mlb_copy:
    
    # create a reader object
    reader = csv.DictReader(mlb)
    
    # create a writer object
    # we're going to use the `fieldnames` attribute of the DictReader object
    # as our output headers, as well
    # b/c we're basically just making a copy
    writer = csv.DictWriter(mlb_copy, fieldnames=reader.fieldnames)
    
    # write header row
    writer.writeheader()
    
    # loop over the rows in the file
    for row in reader:
        
        # what type of object is `row`?
        # how would we find out?
        
        # write row to output file
        writer.writerow(row)